lt 1
ANear-OptimalBest-of-Both-WorldsAlgorithm forOnlineLearningwithFeedbackGraphs
We present a computationally efficient algorithm for learning in this framework that simultaneously achieves near-optimal regret bounds in both stochastic and adversarial environments. The bound against oblivious adversaries is O( αT), where T is the time horizon andα is the independence number of the feedback graph.
- Europe > Italy (0.04)
- Europe > Denmark (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.05)
- South America > Paraguay > Asunción > Asunción (0.04)
- North America > United States (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
c74214a3877c4d8297ac96217d5189b7-Paper.pdf
However, the resulting methods often suffer from high computational complexity which has reduced their practical applicability. For example, in the case of multiclass logistic regression, the aggregating forecaster (Foster et al. (2018)) achievesaregret ofO(log(Bn))whereas Online Newton Step achieves O(eBlog(n))obtaining adouble exponential gaininB (aboundonthenormof comparativefunctions).
- Europe > France > Île-de-France > Paris > Paris (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- North America > United States > Arizona (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Hungary > Budapest > Budapest (0.04)
- North America > United States > Arizona (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Hungary > Budapest > Budapest (0.04)
219ece62fae865562d4510ea501cf349-Supplemental.pdf
If there are multiple LVs, we select the LV with the maximum probabilityp(W). It is a heuristic to improve the empirical performancesuggestedby[13]. The simulated robot pushing experiment is taken from [23]. The simulation returns the location of apushed object given the robot'slocation and the pushing duration, i.e.,x. The portfolio optimization problem is taken from [4].
Monte-Carlo Tree Search by Best Arm Identification
Kaufmann, Emilie, Koolen, Wouter
We consider two-player zero-sum turn-based interactions, in which the sequence of possible successive moves is represented by a maximin game tree T. This tree models the possible actions sequences by a collection of MAX nodes, that correspond to states in the game in which player A should take action, MIN nodes, for states in the game in which player B should take action, and leaves which specify the payoff for player A. The goal is to determine the best action at the root for player A. For deterministic payoffs this search problem is primarily algorithmic, with several powerful pruning strategies available [20]. We look at problems with stochastic payoffs, which in addition present a major statistical challenge. Sequential identification questions in game trees with stochastic payoffs arise naturally as robust versions of bandit problems. They are also a core component of Monte Carlo tree search (MCTS) approaches for solving intractably large deterministic tree search problems, where an entire sub-tree is represented by a stochastic leaf in which randomized play-out and/or evaluations are performed [4]. A play-out consists in finishing the game with some simple, typically random, policy and observing the outcome for player A. For example, MCTS is used within the AlphaGo system [21], and the evaluation of a leaf position combines supervised learning and (smart) play-outs. While MCTS algorithms for Go have now reached expert human level, such algorithms remain very costly, in that many (expensive) leaf evaluations or play-outs are necessary to output the next action to be taken by the player. In this paper, we focus on the sample complexity of Monte-Carlo Tree Search methods, about which very little is known. For this purpose, we work under a simplified model for MCTS already studied by [22], and that generalizes the depth-two framework of [10].
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > France > Hauts-de-France > Nord > Lille (0.04)